Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Applications in Natural Language

Processing

5.1

Background

We ﬁrst overview the background of three aspects of this section: quantization-aware train-

ing for the low-bit language model, post-training quantization for the low-bit language

model, and binary language model.

5.1.1

Quantization-Aware Training (QAT) for Low-Bit Large Language

Models

Large pre-trained language models have achieved remarkable success in various natural

language processing tasks resorting to the increasing model size and computation over-

head [227, 54, 21], which make it prohibitive to deploy these language models on many

resource-constrained devices. To make the deployment of existing language models possible,

various model compression techniques have been proposed, such as pruning [64, 172, 244],

knowledge distillation [107, 217], weight-sharing [51, 125, 98], dynamic computation with

adaptive depth or width [88, 255, 298], and network quantization [285, 221, 195, 6]. Among

these techniques, network quantization enjoys the merit of reducing the size of the model

and the computation overhead without modifying the network architecture. It thus receives

extensive favor, and many methods have been explored to quantify language models.

For now, most language model quantization methods follow quantization-aware training

(QAT), in which the full-precision model is trained for an entire training process. In practice,

such QAT-based methods usually perform better than other quantization paradigms, such

as post-training quantization (PTQ).

5.1.2

Post-Training Quantization (PTQ) for Low-Bit Large Language

Models

Despite QAT producing a satisfactory performance for large language models compared

with post-training quantization (PTQ), which relies on a small calibration set to perform

quantization, it often suﬀers from several issues. Speciﬁcally, QAT usually conducts end-to-

end back-propagation training over the whole training set, which can be slow in training

time, memory demanding, and data consuming. These issues can sometimes be prohibited

for industrial language models.

Compared with the PTQ method, QAT mainly has drawbacks in three aspects: training

time, memory demand, and data consumption. First, QAT conducts training over the entire

training set, so it takes much more time than PTQ over the calibration set. Moreover, recent

QAT methods [6, 285] further combine two-stage knowledge distillation [107], which can

DOI: 10.1201/9781003376132-5

118